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Abstract. Wc give an algorithm that computes the final state of certain growth 
models without computing all intermediate states. Our technique is based on a 
"least action principle" which characterizes the odometer function of the growth 
process. Starting from am approximation for the odometer, we successively correct 
^ I under- and overestimates and provably arrive at the correct final state. 

Internal difFusion-limitod aggregation (IDLA) is one of the models amenable to 
our technique. The boundary fluctuations in IDLA were recently proved to be at 
most logarithmic in the size of the growth cluster, but the constant in front of the 
0^ logarithm is still not known. As an application of our method, we calculate the size 

of fluctuations over two orders of magnitude beyond previous simulations, and use 
the results to estimate this constant. 

1. Introduction 

^ In this paper we study the abelian stack model, a type of growth process on graphs. 

Special cases include internal diffusion limited aggregation (IDLA) and rotor-router 
^ aggregation. We describe a method for computing the final state of the process, given 

'— ' an initial approximation. The more accurate the approximation, the faster the com- 

^ putation. 

^ IDLA. Starting with N chips at the origin of the two-dimensional square grid 7?, 

O each chip in turn performs a simple random walk until reaching an unoccupied site. 

Introduced by Meakin and Dcutch [32] and independently by Diaconis and Fulton 
[13], IDLA models physical phenomena such as a solid melting around a heat source, 
^ electrochemical polishing, and fluid flow in a Hele-Shaw cell. Lawler, Bramson, and 

Griffeath [27] showed that as iV — t- oo, the asymptotic shape of the resulting cluster 
^-H of occupied sites is a disk (and in higher dimensions, a Euclidean ball). 

^ The boundary of an IDLA cluster is a natural model of a random propagating front 

•'-j (Figure 1, left). From this perspective, the most basic question one could ask is, what 

r> is the scale of the fluctuations around the limiting circular shape? Until recently this 

was a long-standing open problem in statistical physics. It is now known that the 
fluctuations in dimension 2 are of order at most logA^ [3, 24]; however, it is still an 
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Figure 1. IDLA cluster (left) and rotor-router cluster with counterclockwise rotor sequence (right) 
of = 10® chips. Half of each circular cluster is shown. Each site is colored according to the 
final direction of the rotor on top of its stack (yeIlow=W, red=S, blue=E, green=N). Note that the 
boundary of the rotor-router cluster is much smoother than the boundary of the IDLA cluster. 
Larger rotor-router clusters of size up to A*' = 10^" can be found at [1]. 



open problem to show that the fluctuations are at least this large. We give numerical 
evidence that logA^ is in fact the correct order, and estimate the constant in front of 
the log. 

Rotor-router aggregation. James Propp [25] proposed the following way of deran- 
domizing IDLA. At each lattice site in 7? is a rotor that can point north, east, south 
or west. Instead of stepping in a random direction, a chip rotates the rotor at its 
current location counterclockwise, and then steps in the direction of this rotor. Each 
of N chips starting at the origin walks in this manner until reaching an unoccupied 
site. Given the initial configuration of the rotors (which can be taken, for example, to 
be all north), the resulting growth process is entirely deterministic. Regardless of the 
initial rotor configurations, the asymptotic shape is a disk (and in higher dimensions, 
a Euclidean ball) and the inner fluctuations are proved to be O(logA^) [29]. The true 
fluctuations appear to grow even more slowly, and may even be bounded independent 
of N. 

Rotor-router aggregation is remarkable in that it generates a nearly perfect disk in 
the square lattice without any reference to the Euclidean norm (x^ -|- y^)^/^. Perhaps 
even more remarkable are the patterns formed by the final directions of the rotors 
(Figure 1, right). 

Low-discrepancy random stack. To better understand whether it is the regular- 
ity or the determinism which makes rotor-router aggregation so round, we follow a 
suggestion of James Propp and simulate a third model, low-discrepancy random stack, 
which combines the randomness of IDLA and the regularity of the rotor-router model. 

Computing the odometer function. The central tool in our analysis of all three 
models is the odometer function, which measures the number of chips emitted from 
each site. The odometer function determines the shape of the final occupied cluster via 
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a nonlinear operator that we call the stack Laplacian. Our main technical contribution 
is that even for highly non-deterministic models such as IDLA, one can achieve fast 
exact calculation via intermediate approximation. Approximating our three growth 
processes by an idealized model called the divisible sandpile, we can use the known 
asymptotic expansion of the potential kernel of random walk on to obtain an 
initial approximation of the odometer function. We present a method for carrying out 
subsequent local corrections to provably transform this approximation into the exact 
odometer function, and hence compute the shape of the occupied cluster. Our runtime 
depends strongly on the accuracy of the initial approximation. 

Applications. Traditional step-by-step simulation of all aforementioned models in 
requires a runtime of order AT^ to compute the occupied cluster. Using our new 
algorithm, we are able to generate large clusters faster: our observed runtimes are 
about A log A for the rotor-router model and about A^'^ for IDLA. By generating 
many independent IDLA clusters, we estimate the order of fluctuations from circu- 
larity over two orders of magnitude beyond previous simulations. Our data strongly 
support the findings of [33] that the order of the maximum fluctuation for IDLA in 
1? is logarithmic in A. Two proofs of an upper bound C log A on the maximum fluc- 
tuation for IDLA in have recently been announced: see [2, 3] and [24]. While the 
implied constant C in these bounds is large, our simulations suggest that the maximum 
fluctuation is only about 0.528 In A. 

For rotor-router aggregation we achieve four orders of magnitude beyond previous 
simulations, which has enabled us to generate fine-scaled examples of the intricate 
patterns that form in the rotors on the tops of the stacks at the end of the aggregation 
process (Figure 1, right). These patterns remain poorly understood even on a heuristic 
level. We have used our algorithm to generate a four-color 10-gigapixel image [1] of the 
final rotors for A^ = 10^° chips. This file is so large that we had to use a Google maps 
overlay to allow the user to zoom and scroll through the image. Indeed, the degree of 
speedup in our method was so dramatic that memory, rather than time, became the 
limiting factor. 

Related Work. Unlike in a random walk, in a rotor-router walk each vertex serves 
its neighbors in a fixed order. The resulting walk, which is completely deterministic, 
nevertheless closely resembles a random walk in several respects [8-10, 14, 18, 22]. 
The rotor-router mechanism also leads to improvements in algorithmic applications. 
Examples include autonomous agents patrolling a territory [35] , external mergesort [5] , 
broadcasting information in networks [15, 16], and iterative load-balancing [19]. 

Abelian stacks (defined in the next section) are a way of indexing the steps of a 
walk by location and time rather than by time alone. This fruitful idea goes back 
at least to Diaconis and Fulton [13, §4]. Wilson [36] (see also [34]) used this stack- 
based view of random walk in his algorithm for sampling a random spanning tree of 
a directed graph. The final cycle-popping phase of our algorithm is directly inspired 
by Wilson's algorithm. Our serial algorithm for IDLA also draws on ideas from the 
parallel algorithm of Moore and Machta [33] . 

Abelian stacks are a special case of abelian networks [6, 11], also called "abelian 
distributed processors." In this viewpoint, each vertex is a finite automaton, or "pro- 
cessor." The chips are called "messages." When a processor receives a message, it can 
change internal state and also send one or more messages to neighboring processors 



4 



FAST SIMULATION OF LARGE-SCALE GROWTH MODELS 



according to its current internal state. We believe that it might be possible to ex- 
tend our method to other types of abelian networks, such as the Bak-Tang-Wiesenfeld 
sandpilc model [4]. Indeed, the initial inspiration for our work was the "least action 
principle" for sandpiles described in [17]. 

Organization of the paper. After formally defining the abelian stack model in §2, 
we describe the mathematics underlying our algorithm in §3. The main result of §3 
is Theorem 1, which uniquely characterizes the odometer function by a few simple 
properties. In §4 we describe the algorithm itself, and use Theorem 1 to prove its 
correctness. §5 discusses how to find a good approximation function to use as input 
to the algorithm. Finally, §6 describes our implementation and experimental results. 

2. Formal Model 

The underlying graph for the abelian stack model can be any finite or infinite 
directed graph G = {V,E). Each edge e G £^ is oriented from its source vertex s(e) to 
its target vertex t(e). Self-loops (edges e such that s(e) = t(e)) and multiple edges 
(distinct edges e, e' such that s(e) = s(e') and t(e) = t(e')) are permitted. We assume 
that G is locally finite — each vertex is incident to finitely many edges — and strongly 
connected: for any two vertices x,y &V there are directed paths from x to y and from 
y to X. At each vertex x G 1/ is an infinite stack of rotors (/5n(x))„^o- Each rotor 
Pnix) is an edge of G emanating from x, that is, s{pn{x)) = x. We say that rotor 
po{x) is "on top" of the stack. 

A finite number of indistinguishable chips are dispersed on the vertices of G ac- 
cording to some prescribed initial configuration. For each vertex x, the first chip to 
visit X is absorbed there and never moves again. Each subsequent chip arriving at x 
first shifts the stack at x so that the new stack is {pn+i{x))n^o- After shifting the 
stack, the chip moves from x to the other endpoint y = t{pi{x)) of the rotor now on 
top. We call this two-step procedure (shifting the stack and moving a chip) firing the 
site X. The effect of this rule is that the nth time a chip is emitted from x, it travels 
along the edge Pn{x). 

We will generally assume that the stacks are infinitive: for each edge e, infinitely 
many rotors p^(s(e)) are equal to e. If G is infinite, or if the total number of chips 
is at most the number of vertices, then this condition ensures that firing eventually 
stops, and all chips are absorbed. 

We are interested in the set of occupied sites, that is, sites that absorb a chip. The 
abelian property [13, Theorem 4.1] asserts that this set does not depend on the order 
in which vertices are fired. This property plays a key role in our method; we discuss 
it further in §3. 

If the rotors Pn{x) are independent and identically distributed random edges e such 
that s(e) = X, then we obtain IDLA. For instance, in the case G = 7?, we can take 
the rotors Pn{x) to be independent with the uniform distribution on the set of 4 edges 
joining x to its nearest neighbors x it ei, x it e2. The special case of IDLA in which all 
chips start at a fixed vertex o is more commonly described as follows. Let Ai = {o}, 
and for ^ 2 define a random set Aj^ of N vertices of G according to the recursive 
rule 

An+1 =AnU {xn} (1) 
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where a;jv is the cndpoint of a random walk started at o and stopped when it first 
visits a site not in A^. These random walks describe one particular sequence in which 
the vertices can be fired, for the initial configuration of N chips at o. The first chip is 
absorbed at o, and subsequent chips are absorbed in turn at sites xi, . . . , x^-i- When 
firing stops, the set of occupied sites is ^at. 

A second interesting case is deterministic: the sequence Pn{x) is periodic in n, 
for every vertex x. For example, on 1? , we could take the top rotor in each stack 
to point to the northward neighbor, the next to the eastward neighbor, and so on. 
This choice yields the model of rotor-router aggregation defined by Propp [25] and 
analyzed in [28, 29]. It is described by the growth rule (1), where xn is the endpoint 
of a rotor-router walk started at the origin and stopped on first exiting A^. 

3. Least Action Principle 
A rotor configuration on G is a function 

r:V 

such that s{r{y)) = v for all v eV. A chip configuration on G is a function 

a:V 

with finite support. Note we do not require a ^ 0. If a{x) = m > 0, we say there are 
m chips at vertex x; if a{x) = —m < 0, we say there is a hole of depth m at vertex x. 
For an edge e and a nonnegative integer n, let 

Rpie, n) = #{l^k^n\ Pk{s{e)) = e} (2) 

be the number of times e occurs among the first n rotors in the stack at the vertex s(e) 
(excluding the top rotor pQ{s{e))). When no ambiguity would result, we drop the 
subscript p. 

Write N for the set of nonnegative integers. Given a function w: 1/ — )• N, we would 
like to describe the net effect on chips resulting from firing each vertex x e V a total 
of u{x) times. In the course of these firings, each vertex x emits u{x) chips, and receives 
i?p(e, n(s(e))) chips along each incoming edge e with t(e) = x. This motivates the 
following definition. 

Definition. The stack Laplacian of a function u: y — )• N is the function 

Apu: V 

given by 

Apu{x) = ^ Rp{e,u{s{e))) - u{x). (3) 

t{e)=x 

The sum is over all edges e with target vertex t(e) = x. We use the notation Ap to 
emphasize the dependence (via Rp) on the rotor stacks {pk{x))k'~^o- 

Given an initial chip configuration ao, the configuration a resulting from performing 
u{x) firings at each site a; G F is given by 

(7 = (TO + ApU. (4) 

The rotor configuration on the tops of the stacks after these firings is also easy to 
describe. We denote this configuration by Topp(n), and it is given by 

Topp(n)(x) = pn(x){x). 
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Figure 2. An example of rotor stacks p and the stack Laplacian ApU. Here, the underlying graph is 
a path of length 5. For instance, the middle vertex has u{vs) — 2 and Apu{v3) = 1 + 1 — 2 = 0, 
since each of its neighbors V2 and V4 has one rotor between the red lines pointing to V3 . 



We also write E'^p for the collection of shifted stacks: 

= Pu{x)+k{x)- 

The stack Laplacian is not a linear operator, but it satisfies the relation 

Ap{u + v) = ApU + Ae^pV. (5) 
Vertices xi, . . . , Xm form a legal firing sequence for o-q if 

aj{xj+i) > I, j = 0,...,m-l 

where 

cTj = do + ApUj 

and 

Uj{x) = #{i ^j:xi = x}. 

Figure 2 shows an example of rotor stacks and the stack Laplacian. 

In words, the condition aj{xj-^-i) > 1 says that after firing xi, . . . ,Xj, the vertex 
Xj+i has at least two chips. We require at least two because in our growth model, the 
first chip to visit each vertex gets absorbed. 

The firing sequence is complete if no further legal firings are possible; that is, 
o'mix) ^ 1 for all X £ V. If xi, . . . ,Xm is a complete legal firing sequence for the 
chip configuration ctq, then we call the function u := Um the odometer of do. The 
odometer tells us how many times each site fires. 

Abelian Property [13, Theorem 4.1] Given an initial configuration do and 
stacks p, every complete legal firing sequence for ctq has the same odometer function u. 

It follows that the final chip configuration am = do + ApU and the final rotor 
configuration Top^(u) do not depend on the choice of complete legal firing sequence. 

Remark. To ensure that u is well-defined (i.e., that there exists a finite complete legal 
firing sequence) it is common to place some minimal assumptions on p and dQ. For 
example, if G is infinite and strongly connected, then it suffices to assume that the 
stacks p are infinitive. 
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Given a chip configuration ao and rotor stacks {pkix))k^o, our goal is to compute 
the final chip configuration am without performing individual firings one at a time. A 
fundamental observation is that by equation (4), it suffices to compute the odometer 
function u of (Tq. Indeed, once we know that each site x fires u{x) times, we can add up 
the number of chips x receives from each of its neighbors and subtract the u(x) chips 
it emits to figure out the final number of chips at x. This arithmetic is accomplished 
by the term ApU in equation (4); see Figure 2 for an example. In practice, it is usually 
easy to compute ApU given u, an issue we address in §4. 

Our approach will be to start from an approximation of u and correct errors. In 
order to know when our algorithm is finished, the key mathematical point is to find 
a list of properties of u that characterize it uniquely. Our main result in this section. 
Theorem 1, gives such a list. As we now explain, the hypotheses of this theorem can 
all be guessed from certain necessary features of the final chip configuration cr^ and 
the final rotor configuration Topp(«). What is perhaps surprising is that these few 
properties suffice to characterize u. 

Let xi, . . . , Xm be a complete legal firing sequence for the chip configuration uq. We 
start with the observation that since no further legal firings are possible, 

• (Tjn{x) ^ 1 for all X eV. 

Next, consider the set A of sites that fire, which is the support of u: 

A = supp('u) := {x e V: u{x) > 0}. 

Since each site that fires must first absorb a chip, we have 

• (Tm(x) = 1 for all x £ A. 

Finally, observe that for any vertex x E A, the rotor r{x) = Top^(it)(a;) at the top of 
the stack at x is the edge traversed by the last chip fired from x. The last chip fired 
from a given finite subset A' of A must be to a vertex outside of A', so A' must have 
a vertex whose top rotor points outside of A' . 

• For any finite set A' C A, there exists x £ A' with t(r(x)) ^ A' . 

We can state this last condition more succinctly by saying that the rotor configuration 
r = Topp('u) is acyclic on A; that is, the spanning subgraph {V,r{A)) has no directed 
cycles. Here r{A) = {r{x) \ x G A}. 

Theorem 1. Let G be a finite or infinite directed graph, p a collection of rotor stacks 
on G, and a chip configuration on G. Fix "u* : F — >■ N, and let A* = supp('u*). Let 
(7* = (To + ApU^, , and suppose that 

• cr* ^ 1; 

• A^ is finite; 

• cr*(.T) = 1 for all x G A*; and 

• Topp(u*) is acyclic on A^. 

Then there exists a finite complete legal firing sequence for ao, and its odometer func- 
tion is . 

A useful mnemonic for Theorem 1 is "no hills, no holes, no cycles." A hill is a 
site x with a^{x) > 1, and a hole is a site x with cr*(x) < Ixga,- Hills are forbidden 
everywhere (cr* {x) ^ 1 for all x) , but it suffices to forbid holes and cycles only on A^ . 
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We break the proof of Theorem 1 into two inequahties. The first inequality can 
be seen as an analogue for the abelian stack model of the least action principle for 
sandpiles [17, Lemma 2.3]. 

Lemma 2. (Least Action Principle) If ^ 1 and is finite, then there exists 
a finite complete legal firing sequence for uq; and ^ u, where u is the odometer 
function of uq. 

Proof. Perform legal firings in any order, without allowing any site x to fire more than 
u^{x) times, until no such firing is possible. Since A^ is finite, this procedure involves 
only finitely many firings. Write u'{x) for the number of times x fires during this 
procedure. We will show that this procedure gives a complete legal firing sequence, so 
that u' = u. 

Write a' = ctq + ^pu' . If u' ^ 1, then u' = uhy the abelian property. Otherwise, 
choose y such that cr'(y) > 1. We must have u'{y) = u^{y), or else it would have been 
possible to add another legal firing to u' . Therefore, if we now perform — u' further 
firings, then since y does not fire, the number of chips at y cannot decrease. Hence 



contradicting the assumption that cr* ^ 1. 

Lemma 3. Suppose that 

• A* is finite; 

• ait{x) ^ 1 for all x G A^; and 

• Topp(M*) is acyclic on A^. 

Then ^ u. 



□ 



Proof. Let 

m{x) = mm{u{x),u^{x)) 
ip = ao + Apm 
a = ao + ApU. 
Then letting p = E"^p, we have from (5) 

o" = (Jo + ApTfi + Ap{u — m) 
= ijj + Ap(u — to). 

Likewise, cr* = '0 + Ap('u* — m). Let 

A = {xeV\ u*{x) > u{x)}. 

Since u ^ 0, we have A C A^,, hence A is finite. We must show that A is empty. 

Wc have (J^,{x) ^ 1 for all a; G ^4 by hypothesis, while cr{x) ^ 1 by the definition of 
the odometer function u. So 

0^ ^((7,(x)-a(x)) 
^ ^ (Ap(u* - m){x) - Ap{u - m){x)) . 
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For X £ A we have u{x) = m{x), so Ap{u — m){x) ^ 0. Hence 
^ ^ Ap(u* — m) 



= ^ ( - (n*(a;) - m{x)) + ^ #{m(s(e)) < A; ^ n*(s(e)) | pjfc(s(e)) = e} j . 

The terms of the inner sum corresponding to edges e such that s(e) ^ A vanish, since 
in that case m(s(e)) = u*(s(e)). Hence 

^Y^{u^{x) - m{x)) ^ X] X] #{"^(^(^)) < ^ ^ M*(s(e)) | Pfe(s(e)) = e} 

xeA xeAt(e)=x 

s(e)eA 

= ^ ^ #{m{y) <k^ u^y) \ t{pk{y)) = x} 

x&A y^A 

= Yl *{^(y) <k^u,iy)\ t{pkiy)) G A}. (6) 

t/SA 

Now suppose for a contradiction that ^ is nonempty. Since Top^(u*) is acychc on A, 
there exists a site ^ G ^ with t(pfc(2;)) ^ A, where A; = ^^(z). Therefore the sum on 
the right side of (6) is strictly less than J2yeAi''^*iy) ~''^{y))i which gives the desired 
contradiction. □ 

We conclude this section by observing a few consequences of Theorem 1. While 

our algorithm does not directly use the results below, we anticipate that they may be 
useful in further attempts to understand IDLA and rotor-router aggregation. 

The stacks p and initial configuration ao determine an odometer function u = 
u{p,ao), which is the unique function satisfying the hypotheses of Theorem 1. In 
particular, given ao, the function u is completely characterized by properties of the 
chip configuration ctq + ApU and the rotor configuration Top^u. Since permuting the 
stack elements pi{x), . . . , Pu{p,ao){x)-i{^) does not change ApU or Top^-u, we obtain the 
following result. 

Corollary 4. (Exchangeability) Let a be a chip configuration on G. Let {pk{x))x&v,k&n 
and [p']^{x))xev,k&i two collections of rotor stacks, with the property that for each 
vertex x & V, the rotors 

p'lix), . . ■ , Pu{p,a){x)-li^) 

are a permutation of 
Suppose moreover that 
Then u{p' , a) = u{p, a) . 



Pl{x),...,Pu{p,a){x)-l{x). 
Pu{p,a)ix){x) = Pu(p,a){x)i^)- 



Edges ei, . . . , Cm G E form a directed cycle if s(ei+i) = t(ej) for i = 1, . . . , m — 1 
and s(ei) = t{em)- The next result allows us to remove directed cycles of rotors from 
the stacks, without changing the final chip or rotor configuration. 
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Corollary 5. (Cycle removal) Let {pk{x))xev,kef^ be a collection of rotor stacks on G, 
and let {xi,ki) for i = l,...,m he distinct pairs such that the edges {pkS^i)}lXi 
form a directed cycle in G. Let a he a chip configuration on G, and suppose that 
ki ^ u{p, a){xi) — 1 for all i = 1, . . . ,m. Let p' be the rotor stacks obtained from p by 
removing the rotors Pkiixi) for all i = 1, . . . ,m and re-indexing the remaining rotors 
in each stack hy N. Then 

u{p,a) = u{p',a) + x 

where x(a;) = #{1 ^ i ^ m \ xi = x} . Moreover, the final chip and rotor configurations 
agree: 

cr + Ap[u(/9,(7)] = a + Ap/[u(p ,cr)] 

Top^[u(/9,(T)] =Top^,[u(p',(7)]. 

Proof. Let / = u{p,o-). The bound on hi implies that Top^/ = Top^,(/ — x)- By 
Theorem 1 , to complete the proof it suffices to check that Apf = A^/ (/ — x) • For any 
vertex x and edge e with s(e) = x, we have 

Rp{e, fix)) = #{l<^k^ fix) I pkix) = e} 

= R,>ie,fix)-xix)) + cie) 

where c(e) = #{1 ^ i ^ m \ Pkiixi) = e}. Here we have used the fact that the pairs 
ixi,ki) are distinct. Hence 

Apfix) = -fix)+ E RpieJix)) 

t(e)=x 

= -m+ E RAe,fix)-xix))+ E (7) 

t{e)=x t{e)=x 

Since the edges {pkiixi)}^i form a directed cycle, we have J2t{e)=x^i^) ~ 
X]s(e)=x '^(^) — x(^)- So (7) simplifies to Ap/(/ — x)(a;), which shows that Apf = 

^Af-x)- □ 

4. The Algorithm: From Approximation to Exact Calculation 

In this section we describe how to compute the odometer function u exactly, given 
as input an approximation ui. The running time depends on the accuracy of the 
approximation, but the correctness of the output does not. In the next section we 
explain how to find a good approximation ui for the example of N chips started at 
the origin in Z^. 

Recall that G may be finite or infinite, and we assume that G is strongly con- 
nected. We assume that the initial configuration ctq satisfies (To(x) ^ for all x, and 
o"o(x) < oo. If G is finite, we assume that o'oix) is at most the number of ver- 
tices of G (otherwise, some chips would never get absorbed). The only assumption on 
the approximation ui is that it is nonnegative with finite support. Finally, we assume 
that the rotor stacks are infinitive, which ensures that the growth process terminates 
after finitely many firings: that is, Ylxev''^i^) ^ 

For X & V, write 

doutix) = #{e G E I s(e) = x} 
dinix) = #{e e E I t(e) = x} 
for the out-degree and in-degree of x. 
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Figure 3. Classic rotor router aggregation of A'^ = 100, 000 chips with counterclockwise rotor se- 
quence. The pictures show the direction of the rotors on top of the stacks after each step of the 
computation (yellow=W, red=S, blue=E, green=N). 



The odometer function u depends on the initial chip configuration do and on the 
rotor stacks {pk{x))k^o- The latter are completely specified by the function R{e,n) 
defined in §3. Note that for rotor-router aggregation, since the stacks are periodic, 
R{e, n) has the simple explicit form 

R{e, n) = 

where j is the least positive integer such that Pj{x) = e. For IDLA, R{e, n) is a random 
variable with the Binomial(n,p) distribution, where p is the transition probability 
associated to the edge e. 

In this section we take R{e, n) as known. From a computational standpoint, if the 
stacks are random, then determining i?(e, n) involves calls to a pseudorandom number 
generator. We address the issue of minimizing the number of such calls in §6.3. 

Our algorithm consists of an approximation step followed by two error-correction 
steps: an annihilation step that corrects the chip locations, and a reverse cycle-popping 
step that corrects the rotors. 

(1) Approximation. Perform firings according to the approximate odometer, 
by computing the chip configuration ai = (Jq + ApUi. Using equation (3), this 
takes time 0{din{x) + 1) for each vertex x, for a total time of 0{#E + #V). 
This step is where the speedup occurs, because we are performing many 
firings at once: Ylx ^i(^) is typically much larger than -\- ij^V . Return ai. 



n + doutjx) - j 
dout{x) 



(8) 



(2) Annihilation. Start with U2 = ui and cj2 = o"i. x £ V satisfies ct2(x) > 1, 
then we call x a hill. If (T2(x) < 0, or if (T2{x) = and U2{x) > 0, then we call 
X a hole. For each x £ Z'^, 

(a) If X is a hill, fire it by incrementing U2{x) by 1 and then moving one chip 
from X to t(Top(n2)(x)). 
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(a) After odometer approx- (b) After annihilation ((72) (c) After cycle popping (0-3) 
imation (cri) 



Figure 4. Classic rotor router aggregation of A'^ = 100, 000 chips with counterclockwise rotor se- 
quence. The pictures show the number of chips after each step of the algorithm. (Location x is 
colored red if ct'{x) = —1, white if cr'(x) = 0, black if cr'(x) = 1, blue if cr'{x) = 2, green if i7'{x) — 3.) 
Note that there are no locations with i7'{x) < — 1 or (t'(x) > 3, and that no chips move during the 
final cycle-popping phase. 

(b) If X is a hole, unfire it by moving one chip from t(Top(u2)(x)) to x and 

then decrementing U2{x) by one. 
A hill can disappear in one of two ways: by reaching an unoccupied site on 
the boundary, or by reaching a hole and canceling it out. When there are no 
more hills and holes, return U2. 

(3) Reverse cycle-popping. Start with U3 = U2 and 

A'i = {x (^V: U'iix) > 0}. 

If Top (113) is not acyclic on ^3, then pick a cycle and unfire each of its vertices 
once. This may create additional cycles. Update ^3 (it may shrink, since n3 
has decreased) and repeat until Top(n3) is acyclic on A3. Output M3. 
Next we argue that the algorithm terminates, and that its final output ^3 equals the 
odometer function u. Step 2 is simplest to analyze if we first fire all hills, and only 
after there are no more hills begin unfiring holes. In practice, however, we found that 
it is much faster to fire hills and unfire holes in tandem; see §6.1 for the details of our 
implementation. 

At the beginning of step 2, all hills are contained in the set 

S = {xeV: ai{x) > 0}. 

Since ao and ui have finite support, ai = ao + ApUi has finite support, so S is finite. 
Since the total number of chips is conserved, we have 

^CTi(x) = ^aoix). 

The right side is ^ #V by assumption. Therefore if S = V, we must have cri(x) = 1 
for all X €z V; in this case there are no hills or holes, and we move on to step 3. 
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Suppose now that 5 is a proper subset of V. Let 



/i = ^(c7l(x)-l) 



be the total height of the hills. Note that firing a hill cannot increase h. If a given 
vertex fires infinitely often, then since the rotor stacks are infinitive, each of its out- 
neighbors also fires infinitely often; since G is strongly connected, it would follow 
that every vertex fires infinitely often. Thus after firing finitely many hills, a chip 
must leave S. When this happens, h decreases. Thus after finitely many firings we 
reach h = and there arc no more hills. 

Next we begin unfiring the holes. After all hills have been settled, we have U2{x) ^ 
for all X e V. The sum Ylxev'^'^i^) finite, and each unfiring decreases it by one. 
To show that the unfiring step terminates, it suffices to show that for all a; G 
the unfiring of holes never causes U2{x) to become negative. Indeed, suppose that 
U2{x) = and U2{y) ^ for all neighbors y of x. Then the number of chips at x is 
(^o{x) + A.pU2{x) ^ 0, so a; is not a hole. Therefore the unfiring step terminates and 
its output U2 is nonnegativc. 

After step 2 there are no hills or holes, i.e., ^ cr2{x) ^ 1 for all x, and if (72 (x) = 
then U2{x) = 0. 

During step 3, we unfirc sites only within A^. Since '^xev ''-'■'^i-'^) finite and 
decreases with each unfiring, this step terminates and its output is nonnegative. 
When a cycle is unfired, each vertex in the cycle sends a chip to the previous vertex, 
so there is no net movement of chips: as = a2- In particular, there are no hills at the 
end of step 3. If (t-s{x) = 0, then a2{x) = 0; since there were no holes at the end of 
step 2, this means that U2{x) = 0, and hence us^x) = 0. So there are still no holes at 
the end of step 3. By construction, Top(«3) is acyclic on ^3. Therefore all conditions 
of Theorem 1 are satisfied, which shows that U3 = u as desired. 



Next we describe how to find a good approximation to the odometer to use as input 
to the algorithm described in §4. Our main assumption will be that the rotor stacks 
are balanced in the sense that 



for all n G N and all edges e, e' with s(e) = s(e'). By definition, rotor-router aggrega- 
tion obeys the strong balance condition 



IDLA is somewhat less balanced: \R(e,n) — R{e',n)\ is typically on the order of ^/n. 
It turns out that this level of balance is still enough to get a fairly good approximation 
and hence a significant speedup in our algorithm. 

If the rotor stacks are balanced, then the stack Laplacian Ap is well-approximated 
by the operator A on functions u: V —> 7, defined by 



In this setting we can approximate the behavior of our stack-based aggregation with 
an idealized model called the divisible sandpile [29]. Instead of discrete chips, each 



5. Approximating the Odometer Function 



R{e, n) « R{e' ■, n) 



\R{e,n) - R{e,n)\ ^ 1. 
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vertex z has a real- valued "mass" 170(2:). Any site with mass greater than 1 can fire 
by keeping mass 1 for itself, and distributing the excess mass to its out-neighbors by 
sending an equal amount of mass along each outgoing edge. The resulting odometer 
function 

vi^z) = total mass emitted from z 
satisfies the discrete variational problem 

V ^ 

^ 1 - 0-0 (9) 
■u(Ai; - 1 + (Jo) = 0. 

In words, these conditions say that each site emits a nonnegative amount of mass, 
each site ends with mass at most 1, and each site that emits a positive amount of 
mass ends with mass exactly 1. The conditions (9) can be reformulated as an obstacle 
problem, that of finding the smallest superharmonic function lying above a given 
function; see [30]. That formulation shows existence and uniqueness of the solution v. 

If the rotor stacks are sufficiently balanced, we expect the divisible sandpile odome- 
ter function v to approximate closely our abelian stack odometer u. The next question 
is how to compute or approximate v. The obstacle problem formulation shows that v 
can be computed exactly by linear programming. Such an approach works well for 
small to moderate system sizes, but for the sizes we are interested in, the number of 
variables v{z) is prohibitively large. 

Fortunately, for specific examples it is sometimes possible to guess a near solu- 
tion w ^ V. We briefly indicate how to do this for the specific example of interest to 
us, the initial configuration 

(TO = N60 

consisting of N chips at the origin o G Z^. In that case, the set of sites that are fully 
occupied in the final divisible sandpile configuration ao + Av is very close to the disk 

Br = {zeZ'^: \z\ < r} 

of radius r = a/TV/tt; see [29, Theorem 3.3]. Here l^j = (z^ -|- z^)^''^ is the Euclidean 
norm. Thus we are seeking a function it; : ^ M satisfying 

Aw = 1 - NSo in Br 

w ^ on dBj.. 

An example of such a function is 

w{z) = |2p - Na{z) - + Na{{r, 0)) (10) 

where a{z) is the potential kernel for simple random walk {Xn)n^o started at the origin 
in 7?, defined as 

00 

a{z) = inXn = 0)- ¥{Xn = z)) . 
n=l 

Its discrete Laplacian is Aa = Sg- 

As input to our algorithm we will use the function 

w{z)'^ := m.8ix{0, w{z)) 
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where w{z) is given by (10). One computational issue remains, which is how to com- 
pute the potential kernel a{z). The potential kernel has the asymptotic expansion [20, 
Remark 2] 

a{z) = ^ In + ^ + J_ ?44^ + 0{\zn (11) 
TT Dvr \z\^ 

where lo = z/\z\ and k = ^J"^^ ; here 7 « 0.577216 is Euler's constant lini(Y^^=i ^ ~ 
Inn). Note that if 9 is the argument of z, then 

8i4l4 -1 = 8 sin^ 9 cos^ 9-1 

= 2 sin^ 26' - 1 

= sin^ 29 - cos^ 29 

= — cos A9. 

Thus, identifying with Z + iZ C C, we can write 

/ ^ 2 , , , 1 Re(z4) ,4, 
a{z) = - In z +K - r\^ + 0{ z ^ . 



TT 



For z close to the origin the error term 0(|2;|~^) becomes significant. Therefore, we 
use the McCrea- Whipple algorithm [31] (see also [26]) to determine a{z) exactly for 
l^l < 100. This algorithm uses the exact identity 

A IT' -I 

4 ^ 1 

a(n + in) = — > — 

^ ^ TT ^ 2/c - 1 

fe=i 

for ra ^ 0, together with the relation Ao = 60 and reflection symmetry across the real 
and imaginary axes to compute a{z) recursively. The values of a{z) for z GZ + iZ are 
rational linear combinations of 1 and -. 

TT 

Now we can describe the function ui that wc used as input to the first step of our 
algorithm. Let r = N/ix. Approximating the term a((r, 0)) in (10) by f logr + k, 
we set 

ui{z) = [\zf + {2\nr - 1 + TTK - 7ra{z))] , \z\ < 100. 

Here [t] = |_t + |J denotes the closest integer to t G M. For \z\ ^ 100 we use the 
asymptotic expansion for a{z) in (10), which gives 



ui{z) 



r Ke{z 



4> 

Up + r^ f 21n^ - 1+ ^, ,„ 

\z\ blzr 



\z\ ^ 100, (12) 



where t"*" := max(t, 0). Including more terms of the asymptotic expansion of a(z) 
from [26] improves the approximation very slightly, but increases the overall runtime. 

6. Experimental Results 

We implemented our algorithm for three different growth models in : rotor-router 
aggregation, IDLA, and a hybrid of the two which we call "low-discrepancy random 
stack." In this section we discuss some details of the implementation, comment on the 
observed runtime, and present our findings on the fluctuations of the cluster An from 
circularity for large N. 

As a basis for comparison to our algorithm, consider the time it takes to compute 
the occupied cluster An for rotor-router aggregation by the traditional method of 
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(a) iV = 100000. (b) iV = 100100. (c) N = 100200. 



Figure 5. Classic rotor router aggregation with counterclockwise rotor sequence. The pictures show 
the quality of the odometer approximation for different values of N, as measured by the difference 
Ml — u. The site x is colored blue if ui{x) > u{x), red if iii(a::) < u{x), and white if ui{x) = u{x). 
The dramatic dependence on A*' suggests that our approximation ui captures substantially all of 
the large-scale regular structure in u. 

firing one vertex at a time. If zi, . . . , ztv G are the locations of the N chips, define 
the quadratic weight Q{z) = X^^j^ NiPj where K^;,?/)! = (x^ + is the Euclidean 

norm. Firing a given vertex z four times results in exactly one chip being sent to each 
of the four neighbors zzizci, ^±62. The net effect of these four firings on the quadratic 
weight is to increase Q by 

|z-heip + |z-eip + |z + e2p + |2;-e2|^-4|2;p = 4. 

Thus, the total number of firings needed to produce the final occupied cluster Ajy 
is approximately X^zeAjv I'^l^' Since Ajy is close to a disk of area A^, this sum is 
about iVV2vr. 

Traditional step-by-step simulation therefore requires quadratic time to compute 
the occupied cluster. Step-by-step simulation of IDLA also requires quadratic time, as 
observed in [27, 33]. We found experimentally that our algorithm ran in significantly 
shorter time: about A^logA^ for the rotor-router model (Table 1), and about N^'^ for 
IDLA (Table 2). 

6.1. Implementation details. We implemented the described algorithm in C+-I-. 
The source code is available from [1]. It is easy to compute the odometer approximation 
for z with l^l ^ 100 according to equation (12). However, the odometer approximation 
for z with \z\ < 100 is less straightforward as the McCrea- Whipple algorithm [31] is 
numerically very ill-conditioned. In order to avoid escalating errors with fixed precision 
fioating point numbers, we used the computer algebra system Maple to precompute 
a(z) as a rational linear combination of 1 and - for \z\ < 100. 

\ / TV ' ' 

For the annihilation step described in §4, we used a multiscale approach to cancel 
out hills and holes efficiently. More specifically, let Li,L2,... be an exponentially 
growing sequence of integers. For each i ^ 1 do 
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(a) iV = 100000. (b) N = 100100. (c) = 100200. 



Figure 6. Classic rotor router aggregation with counterclockwise rotor sequence. The pictures show 
the quality of the odometer approximation after the annihilation phase, for different values of A'', as 
measured by the difference U2 — u. The site x is colored blue if U2{x) > u{x), white if U2{x) = u{x). 
Note that after annihilation, there are no longer any sites satisfying U2{x) < u{x). The remaining 
odometer difference also shows how many cycles are then popped in the last phase of our algorithm. 
The darker the color, the more cycles run through this location. 

• Substep i: fire each hill / unfire each hole until it either cancels out or reaches 
a site in d := {Li Z xZ) U (Z x L^Z). 

We used Li = 1 and Lj+i = [1.9 Lj] for i ^ 1. Experimentally, the choice of 1.9 
resulted in the fastest run time. During each substep i, we scan the grid and for each 
site z ^ Gi, if z is a hill, fire it until it is no longer a hill; if z is a hole, unfire it until it 
is no longer a hole. We repeat this scanning procedure until no hills or holes remain 
outside of Gi. The result is that a large number of hills and holes meet and cancel 
each other out, while the remainder are swept into the much sparser set Gi. We then 
proceed to substep i + stopping when Li exceeds the diameter of the set of sites that 
absorb a chip. At this stage we perform a final substep with Gi = 0: in other words, 
repeatedly scan the grid, firing hills and unfiring holes with no restrictions on their 
location. When no more hills or holes remain, we proceed to the reverse cycle-popping 
phase described in §4. 

Our rotor-router calculation (§6.2) was performed on a Fujitsu RX600S5 server with 
four Xeon X7550 processors and 2048 GB main memory. Our IDLA calculations (§6.3) 
were performed on a cluster of 96 Sun Fire V20z with AMD Opteron 250 processors. 
For IDLA, our method depends strongly on the availability of a high-quality pseudo- 
random number generator. We used the cryptographically secure generator Advanced 
Encryption Standard (AES) [21], which is the official successor of the well-known Data 
Encryption Standard (DES). We used a key size of 256 bits with the Rijndael cipher 
implementation by Rijmen, Bosselaers and Barreto, which is also part of OpenSSH. 

We observed that C's built-in rand( • ) function, which has a small period, produces 
a noticeably smaller difference between inradius and outradius (about 13% smaller for 
N = 2^^). We did not pursue this further to study whether this difference persists for 
larger values of A^. 
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Number of 
chips N 


Runtime 


Radius Difference 
absolute recentered 


II ... ... II / jvr 

||wi — u\\i/N 


max \ui — u\ 


highest 
hill 


deepest 
hole 


— 

2^''=1,024 


1.60 ms 


1.324 


0.278 


1.800 


6 


3 


-1 


2^^=4,096 


2.58 ms 


1.523 


0.138 


3.370 


10 


3 


-1 


2i''=16,384 


5.71 ms 


1.579 


0.166 


2.417 


12 


3 


-1 


2i''=65,536 


21.5 ms 


1.611 


0.429 


4.461 


17 


3 


-1 


Ol8 0^!0 1/1/1 

z =20Z,i44 


67.1 ms 


1.565 


0.346 


2.919 


16 


3 


-1 


220=1,048,576 


0.26 sec 


1.642 


0.362 


4.323 


23 


3 


-1 


222=4J94,304 


1.04 sec 


1.596 


0.316 


4.220 


29 


3 


-1 


22-*=i6,777,216 


3.53 sec 


1.614 


0.396 


3.974 


45 


3 


-1 


2^6=67,108,864 


0.24 min 


1.658 


0.368 


4.695 


62 


3 


-1 


2^8=268,435,456 


0.98 min 


1.639 


0.340 


4.463 


83 


3 


-1 


230=1,073,741,824 


4.04 min 


1.635 


0.414 


4.309 


91 


3 


-1 


2^2=4,294,967,296 


0.28 liours 


1.650 


0.366 


4.383 


172 


4 


-2 


23^=17,179,869,184 


1.10 liours 


1.688 


0.439 


4.734 


252 


11 


-8 


2-'«!=68,719,476,736 


3.80 liours 


1.587 


0.385 


5.408 


353 


38 


-35 



Table 1. Simulation results for classic rotor-router aggregation with counterclockwise rotor sequence. 
The given runtime is the total runtime of the calculation of one rotor-router aggregation of the given 
size on a Fujitsu RX600S5 server. The next two columns show the difference between the outradius 
and inradius of the occupied cluster Am, measured with respect to the origin ("absolute") and 
with respect to the putative center of mass (§, |) ( "rcccntcrcd" ) . The next two columns give two 
measurements of the error of our odometer approximation wi , the total absolute error and maximum 
pointwise error. In the last two columns, "highest hill" and "deepest hole" refer respectively to 
maxx cri(x) and miux ai{x). 



6.2. Rotor-router aggregation. In the classic rotor-router model, the rotor stack is 
the cyclic sequence of the four cardinal directions in counterclockwise order. Table 1 
shows some statistical data of our computation. The absolute error in our odometer 
approximation 

\\ui — u\\i = \ui{x) — u{x)\ 

X 

appears to scale linearly with A^. This quantity is certainly a lower bound for the 
running time of our algorithm. The measured runtimes indicate close-to-linear runtime 
behavior, which suggests that our multiscale approach to canceling out hills and holes 
is relatively efficient. 

Figure 5 depicts the odometer difference ui{x) — u{x) for three different values of A''. 
Figure 6 depicts the odometer difference U2{x) — u{x) after the annihilation step of 
the algorithm. 

The asymptotic shape of rotor-router aggregation is a disk [28, 29]. To measure 
how close AT is to a disk, we define the inradius and outradius of a set ^ C by 

rin{A) = min{|a;| : x ^ A} 

and 

ToutiA) = max{|x| : x G A}. 

We then define 

diff(Ar) = VoutiAN) - TiniAN). 
A natural question is whether this difference is bounded independent of N. We cer- 
tainly expect it to increase much more slowly than the order log N observed for IDLA. 

Kleber [25] calculated that diff(3 • 10^) ^ 1.6106. We can now extend the measure- 
ment of diff(Ar) up to AT = 2^^ 6.8-10^'^ (Table 1, third column). Our algorithm runs 
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Figure 7. Difference between the inradius and outradius of the rotor-router aggregate, for different 
numbers of chips A''. The single dots are individual values of diff(A'') (left) and diff'(A'^) (right). The 
darker curves show the averages diff(Af) and diff' (N) as defined in equations (13) and (14). (Color 
scheme: WNES=blue, WNSE=green, WENS=red.) 



in less than four hours for this value of N; by comparison, a step-by-step simulation 
of this size would take about 23000 years on a computer with one billion operations 
per second. In our implementation, the limiting factor is memory rather than time. 

Up to dihedral symmetry, there are three different balanced period-4 rotor sequences 
for Z^: WENS, WNSE, and WNES. The notation WENS means that the first four rotors in 
each stack point respectively west, east, north and south. 

Figure 7a shows the radius difference diff(iV) for various for the three different 
rotor sequences. As these values are rather noisy, we have also calculated and plotted 
the averages 

diff(iV):=^^ E difW (13) 



with 

for iV ^ 10^ 



I{N) 



N 3JV 
2 ' 2 



iV-5- 10^iV-^5• 10^] foriV>10'^. 



Note that in Figure 7a, the radius difference diff(A^) grows extremely slowly in N. In 
particular, it appears to be sublogarithmic. 

We observe a systematic difference in behavior for the three different rotor sequences. 
The observed radius differences are lowest for WNSE, intermediate for WNES, and highest 
for WENS. For example, 

{1.034 for WNSE, 
1.623 for WNES, 
1.837 for WENS. 

This difference can be partially explained by considering the center of mass of the 
aggregate. Recall that our convention is "retrospective" (as opposed to "prospective" ) 
rotor notation: that is, the rotor currently on top of the stack indicates where the last 
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chip has gone rather than where the next chip will go. Hence for WNES rotors, the first 
time each site fires it sends a chip north, the next time east, then south, then west. 
As about 1/4 of the sites end up in each of the four rotor states, for WNES rotors about 
half of the sites send one more chip N than S, and (a different but overlapping) half 
send one more chip E than W. As a result, the center of mass of the set of occupied 
sites is close to (1/2, 1/2). For WENS the center of mass is close to (3/4, 1/4), and for 
WNSE it's close to (1/4, 1/4). 

In some sense, a better measure of circularity than diff(A^) is the radius difference 
relative to the center of mass. Thus we define 

diff (iV) = rout{AN - c) - rin{AN - c) 

where c is one of (1/2,1/2), (3/4,1/4), or (1/4,1/4) chosen according to the rotor 
sequence used. Let 



The differences are now significantly smaller, and the two non-cyclic rotor sequences 
WNSE and WENS have nearly the same radius difference for large N. To see why, note that 
WENS is obtained from WNSE by a shift in the stacks (to EWNS) followed by interchanging 
the directions east and west. Thus the observed difference in diff(iV) between these 
two rotor sequences is entirely due to the effect of the initial condition of rotors primed 
to send chips west. By adjusting for the center of mass, we have largely removed this 
effect in diff (A^). 

6.3. Internal Diffusion Limited Aggregation (IDLA). In IDLA, the rotor direc- 
tions Pk{x) for X ^1? and A; G Z are chosen independently and uniformly at random 
from among the four cardinal directions. In the course of firing and unfiring during 
steps 2 and 3 of our algorithm, the same rotor pk{x) may be requested several times. 
Therefore, we need to be able to generate the same pseudorandom value for pfe(x) each 
time it is used. Generating and storing all rotors pk{x) for all x and all 1 ^ A; ^ u\{x) 
is out of the question, however, since it would cost Q{N'^) time and space. 

Moore and Machta [33] encountered the same issue in developing a fast parallel 
algorithm for IDLA. Rather than store all of the random choices, they chose to store 
only certain seed values for the random number generator and generate random walk 
steps online as needed. Next we describe how to adapt this idea to our setting for fast 
serial computation of IDLA. 

The AES pseudorandom number generator takes as input a block of 128 bits and 
"encrypts" it, outputting a block of 128 pseudorandom bits. We interpret the output 
block as the binary expansion of a number in the interval [0, 1). Let rnd{b) be the 
pseudorandom number generated from input block b. Let 



where block{x, k,a) is a simple deterministic function that assumes distinct values for 
each triple (x, k, a) of site x, odometer value 1 ^ k ^ K, and integer 1 ^ a ^ ^. The 




(14) 



These values are plotted for various N in Figure 7b. We find 




0.499 for WNSE and WENS, 
0.338 for WNES. 



Ukix) = rnd{block(x, k, a)) 
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Number of 
chips N 


Average 
Runtime 
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Difference 


IImi -M||i/Ar3/2 

1 1 "'A 1 1 -L / 


max \ui — u\ 


Number 
of runs 


211=1.024 


9.80 ms 


3 1 98+0 569 


490+n 057 


134+27 




10'' 


211=2,048 


26.9 ms 


3.569±0.547 


0.516+0.054 


220+41 




10'' 


212=4,096 


73.9 ms 


3.948±0.553 


0.541+0.051 


355+62 




10^ 


213=8.192 


0.21 sec 


4 307+0 556 


565+0 049 


568+93 




10'' 


21^1=16.384 


0.62 sec 


4.664+0.566 


0.588+0.047 


901+139 




10'' 


215=32.768 


1.81 sec 


5.027+0.578 


0.610+0.045 


1.418+207 




10*' 


2i'5=65,536 


5.29 sec 


5.393+0.578 


0.631+0.043 


2,216+307 




lO'' 


217=131,072 


0.26 min 


5.763+0.584 


0.652+0.042 


3,443+456 




10^ 


2i**=262,144 


0.76 min 


6.125±0.588 


0.673+0.041 


5,317+672 




10^ 


21^=524,288 


2.26 min 


6.493±0.593 


0.692+0.039 


8,179+985 




10^ 


2'^°=1,048,576 


6.74 min 


6.858+0.594 


0.711+0.038 


12,522+1,455 




10^ 


2^1=2,097,152 


0.34 liours 


7.222+0.600 


0.730+0.038 


19,085+2,131 


6 


lO'i 


222=4,194,304 


1.01 liours 


7.596+0.600 


0.748+0.036 


29,007+3,109 


6 


lO'i 


223=8,388,608 


3.95 liours 


7.968+0.601 


0.767+0.036 


44,007+4,471 


3 


103 


224=16,777,216 


14.9 hours 


8.319+0.605 


0.783+0.035 


66,418+6,763 


3 


103 


226=33,554,432 


44.3 hours 


8.699+0.575 


0.801+0.033 


99,667+10,192 


4 


102 



Table 2. Simulation results for IDLA. The given runtime is the total time taken for the calculation 
of one IDLA cluster of the given size on a single core. To fit within 4 GB (8 GB for N = 2^^) main 
memory, we used A = for AT 2^^, A = 2 for iV = 2^^, A = 5 for iV > 2^*. The next column shows 
the difference between the outradius and inradius of the occupied cluster An- The fourth and fifth 
columns give two measurements of the error of our odometer approximation mi , the total absolute 
error and maximum pointwise error. The values shown are averages and standard deviations over 
many independent trials; the last column shows the number of trials. 



integer a is fixed for each run of the algorithm, and A is the total number of runs of the 
algorithm; this way, each run generates an independent IDLA cluster. The bound K 
is chosen to be safely larger than the maximal odometer value ui{o) 2r^ Inr. 
Writing t, i, for the four outgoing edges from site x, we set 

t ifO^C/fe(x)<l/4, 

^ if 1/4 ^ Ukix) < 1/2, 

I if 1/2 ^ ?7fe(a;) < 3/4, ^ ^ 

^ if 3/4 ^ Uk{x) < 1. 

The first step of the algorithm described in §4 is to calculate ai from the odometer 
approximation ui. In this calculation, the definition of R{e,n) given in equation (2) 
involves evaluating pk{x) for all 1 ^ /c ^ n. As this is much too expensive, we instead 
use the fact that R{e,n) is a random variable with the Binomial(n, 1/4) distribution. 
In steps 2 and 3 of the algorithm, we need to sample some individual rotors Pk{x), 
but typically not too many: on the order of ^/ui(x). The distribution of these rotors 
depends on the binomials already drawn. We think of first populating an urn with 
balls of 4 colors corresponding to the directions '[, — J,, When the algorithm asks 
for an individual rotor, we draw a ball at random from the urn using our knowledge 
of how many balls of each color remain. 

This approach works well for small and moderate system sizes, but for large N it 
is too memory-intensive. The memory usage comes from the need to store the rotors 
previously drawn in order to keep track of how many balls of each color remain in the 
urn. Note that keeping a count does not suffice, because the algorithm may request a 
single rotor multiple times. 



Pk{x) := < 
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moment m Re(Af„ (vIat ) )/ 

(a) Sample variance of the first 100 moments. (b) Histograms of the first three mo- 
ments. 



Figure 8. Complex moments of the IDLA cluster. Left: The sample variance V(m) = 
ERe{MmiAN)/VNf of the real parts of the first 100 moments, for iV = 2^°,..., 2^°. As iV 
increases, the variance of the real part of the m-th moment approaches l/(2m -1-2), in agreement 
with the results of [23]. Right: Histogram of the real part of the first three moments for A'' — 2^®. 
The histogram shows 1,000,000 independent runs in bins of size 0.05. Data for the imaginary parts 
is similar. 



Fix a parameter A ^ representing the tradeoff between time and memory. A larger 
value of A will result in saving memory at the cost of additional time. Let 

f[x) = (ui{x) - X^/ui{x)^ . 

For each site x with /(x) > 0, we sample three binomial random variables 

B ~ Binomial(/(x), 
B' ~ Binomial(/(x) -B,^), 
B" ~ Binomial(/(x) -B-B\\). 

We then set 

R{\,f{x)) = B 
Ri^,f{x)) = B' 

R{iJ{x)) = B" 
R{^,f{x)) = f{x)-B-B' -B". 

Next, to implement step 1 of the algorithm described in §4, we need to know 
R{e,ui{x)). So we compute 

R{e, ui{x)) = R{e, f{x)) + < A; ^ mi(x) | pkix) = e}. 
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Figure 9. Difference between inradius and outradius for different numbers of cliips A'^ for IDLA (§6.3) 
and the low discrepancy random stack model (§6.4). Dots indicate means and error bars indicate 
standard deviations of the random variable diff(A'^) over many independent trials. The respective 
data can be found in Tables 2 and 3. 



Note that if A is large, then this calculation is expensive in time, since it involves 
calling the pseudorandom number generator to draw as many as Xy^ui{x) rotors 

Pk{x), f{x) <k ^ ui{x) 

using equation (15). But, crucially, these rotors do not need to be stored. 

During steps 2 and 3 of the algorithm, we sample any rotors pk{x) for k > f{x) 
as needed using (15). Rotors pk{x) for k ^ f{x) can be sampled online as needed 
according to the distribution 

ifC/.(x)G [O,^), 

ifC/.(x)G 

if u (x) G rMM±Mz^M m,k)+R{^,k)+Ra,k) \ 

k\ J ^ fc ' k y ' 

iiUk{x)€ [mm±m^.m±mm),i). 

Initially, the values R{e, k) are known only for k = f{x). We generate the rotors pk{x) 
as needed in order of decreasing index /c, starting with k = f{x). Upon generating a 
new rotor pfc(x) = e, we inductively set 

R{e,k- 1) = R{e,k) - 1 

and R{e',k — 1) = R{e',k) for e' G {t, — J,, -^j — {e}. These values specify the 
distribution for the next rotor pk-i{x) in case it is needed later. 

The results of our large-scale simulations of IDLA are summarized in Table 2, ex- 
tending the experiments of Moore and Machta [33] {N ^ 10^'^^ with 100 trials) to over 
10^ trials for N ^ 2^^ and over 300 trials for N ^ 2^^ ~ lO"^'^. The observed runtime 
of our algorithm for IDLA is about A^^'^; in contrast, building an IDLA cluster of 
size A^ by serial simulation of N random walks takes expected time order A"^ (cf. [33, 
Fig. 3]). 



Pk{x) := < 
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An interesting question is whether the runtime could be reduced further by starting 
from a random odometer approximation ui instead of the deterministic approxima- 
tion ui . One approach is to draw binomials as above (taking A = 0) , and use them to 
define a "warped" Laplacian operator A, given by 

Af{x) = ^^fiy)-fix). 

yr^X y 

Here By = ui{y), and Byx = R{{y , x) , ui{y)) is the binomial associated to the di- 
rected edge {y,x). We then take ui to be the solution to the variational problem (9), 
with A replaced by A. This problem can be formulated as a linear program: minimize 
X^3,Si(a;) subject to the constraints ui ^ and Aui ^ 1 — NSg- One could even 

iterate this construction, using ui to draw new binomials and get a new warping A 

and a new approximation ui. A small number of iterations should suffice to bring 
the approximation very close to the true odometer. The main computational issue 
is how to quickly solve (or even approximately solve) these linear programs, which 
are sparse but quite large: the number of variables is about N. We achieved some 
modest speedup with this kind of approach, but not enough to justify the additional 
complexity. 

To measure the circularity of the IDLA cluster, we computed the complex moments 

M^(^^) = y: (^)" 

zeAN 

for m = 1, . . . , 100. Here r = ^/NJtt, and we view z G Aj<; as a point in the complex 
plane by identifying T? with Z + iZ. These moments obey a central limit theorem [23]: 
Mm(^Ar)/\/iV converges in distribution as — t- oo to a complex Gaussian with vari- 
ance l/(m+l). The distribution of the real part of Mm{Aiq) / y/N is shown in Figure 8. 

The expected value of the difference diff(Ar) between outradius and inradius grows 
logarithmically in N: the data in the third column of Table 2, graphed in Figure 9(a), 
fit to 

Ediff(Ar) = 0.528 ln(A) - 0.457 

with a coefficient of determination of B? = 0.99994. Error bars in figure Figure 9(a) 
show standard deviations of the random variable diff(A). 

Since more than one reader has remarked to us that the straight line fit in Fig- 
ure 9(a) looks "too good to be true," we comment briefly on why we believe it comes 
out this way. The random variable diff(Ar) measures the largest fluctuation of ^jv 
from circularity (over all directions). Very roughly speaking, since we believe the 
fluctuations in different directions are close to indpendent, diff(A) behaves like the 
maximum of many indpendent random variables, which is highly concentrated. Note 
that the size of the standard deviation, represented by the error bars in Figure 9(a), is 
approximately constant: it does not grow with N . This finding is consistent with the 
connection with Gaussian free field revealed in [23] . Indeed, if Mj^ is the maximum of 
the discrete two-dimensional Gaussian free field in an A/" x AT box, then the mean EMjv 
has order log A^, and the sequence of random variables {Mjv — EMjvlAfj.! is tight [7]. 
Therefore it is natural to believe (although still unproved) that the variance of Mjv 
has order 1, and that it remains order 1 if the maximum is taken over the boundary 
of a discrete ball instead of a box. 
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Number of 
chips N 


Average 
Runtime 


Radius 
Difference 


— ■"lli/JV max |mi — ii| 


Number 
of runs 


2io 


=1,024 


3.16 ms 


1.026±0.209 


1.34±0.16 


6.00±0.90 


5 


10^ 




=2,048 


6.21 ms 


1.183±0.180 


1.47±0.17 


6.83±0.94 


5 


10-^ 




=4,096 


12.0 ms 


1.256±0.188 


1.60±0.18 


7.65±1.00 


5 


10-"^ 


213 


=8,192 


23.9 ms 


1 314+0 176 


1.73±0.19 


8.52±1.04 


5 


10^ 


2" 


= 16,384 


49.7 ms 


1.405±0.154 


1.86±0.20 


9.40±1.07 


5 


lO'^ 


215 


=32,768 


fl 1 spr 


1.444±0.154 


1.99±0.21 


10.3±1.1 


5 


10^ 


216 


=65,536 


0.21 sec 


1.522±0.160 


2.11±0.22 


11.2±1.2 


5 


10^ 




=131,072 


0.45 sec 


1.583±0.144 


2.23±0.23 


12.2zbl.2 


5 


10^ 


2i8 


=262,144 




1.646±0.142 


2.35±0.24 


13.2±1.2 


5 


10^ 


2i9 


=524,288 


1.90 sec 


1.694±0.135 


2.46±0.24 


14.1±1.3 


5 


10^ 


220 


=1,048,576 


3.88 sec 


1.753±0.124 


2.59±0.26 


15.1±1.3 


5 


10^ 


221 


=2,097,152 


7.96 sec 


1.808±0.124 


2.73±0.28 


16.2±1.4 


5 


10" 


222 


=4,194,304 


0.27 min 


1.850±0.117 


2.86±0.29 


17.3±1.4 


5 


10* 


223 


=8,388,608 


0.55 min 


1.893±0.114 


2.98±0.30 


18.4±1.4 


5 


10* 


224 


=16,777,216 


1.13 min 


1.942±0.109 


3.11±0.31 


19.4±1.5 


5 


103 


225 


=33,554,432 


2.32 min 


1.983±0.109 


3.25±0.33 


20.6±1.5 


5 


10^ 


226 


=67,108,864 


4.74 min 


2.030±0.106 


3.35±0.32 


21.6±1.5 


5 


10^ 


227 


=134,217,728 


9.72 min 


2.070±0.093 


3.51±0.35 


22.9±1.5 


5 


10^ 


228 


=268,435,456 


0.33 hours 


2.108±0.091 


3.61±0.36 


24.1±1.6 


5 


10^ 



Table 3. Simulation results for low-discrepancy random stack. The given runtime is the total runtime 
of the calculation of one cluster of the given size on one core of am AMD Opteron processor 8222. The 
next column shows the difference between the outradius and inradius of the occupied cluster An- 

The fourth and fifth columns give two measurements of the error of our odometer approximation ui , 
the total absolute error and ma^dmum pointwise error. The values shown are averages and standard 
deviations over many independent trials; the last column shows the number of trials. 



6.4. Low-Discrepancy Random Stack. In the rotor-router model (§6.2), the neigh- 
bors are served in a maximally balaneed manner, while in IDLA (§6.3), the rotor stack 
is completely random. Following a suggestion of James Propp, we examine a model 
which combines both features by using low-discrepancy random stacks. In this model 
the neighbors are served in a similarly balanced manner as in the rotor-router model. 
The rotor stacks consist of blocks of length 4, chosen independently and uniformly at 
random from among the 24 permutations of NESW. Hence the rotor stack is random, 
but still satisfies \R{e,n) — R{e',n)\ < 1 for all n and all edges e and e' such that 
s(e) = s(eO. 

This model can be implemented with our method in the same way as IDLA. Fig- 
ure 9b gives averages and standard deviations for the radius difference diff(iV) up to 

= 2^^ = 268,435,456. In contrast to IDLA, the difference between inradius and 
outradius now grows slower than logarithmically in N, and is not much larger than the 
corresponding difference for the rotor-router model. In fact, the data points Ediff(A^) 
of Table 3 fit to 

Ediff(A^) = 1.018 In In(Ar) - 0.919 

with a coefficient of determination of = 0.998. Of course, it is very hard to 
distinguish empirically between slowly growing functions such as In \n{N) and ^y\I^{N), 
so we cannot be sure of the exact growth rate; among several functions we tried, 
Inln(A^) had the best fit. The very slow growth of diff(A^) for the low-discrepancy 
random stack suggests that the extremely good circularity of the rotor-router model 
is mainly due to its low discrepancy rather than its deterministic nature. 
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7. Further Directions 

We have proved that the abehan stack model can be predicted exactly by correcting 
an initial approximation. In experiments, we found that the correction step is quite fast 
if a good initial approximation is available. It would be interesting to investigate what 
other classes of cellular automata can be predicted quickly and exactly by correcting 
an initial approximation. 

The abelian sandpile model in 1? produces beautiful examples of pattern formation 
that remain far from understood [12, 17]. Using Lemma 2.3 of [17], it should be possible 
to characterize the sandpile odometer function in a manner similar to Theorem 1. In 
this characterization, the recurrent sandpile configurations play a role analogous to 
the acyclic rotor configuration in Theorem 1. The remaining challenge would then be 
to find a good approximation to the sandpile odometer function. 
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